Scale-Adaptable Recrawl Strategies for DHT-Based Distributed Web Crawling System
نویسندگان
چکیده
Large scale distributed Web crawling system using voluntarily contributed personal computing resources allows small companies to build their own search engines with very low cost. The biggest challenge for such system is how to implement the functionalities equivalent to that of the traditional search engines under a fluctuating distributed environment. One of the functionalities is incremental crawl which requires recrawl each Web site according to the update frequency of each Web site’s content. However, recrawl intervals solely calculated from change frequency of the Web sites may mismatch the system’s real-time capacity which leads to inefficient utilization of resources. Based on our previous works on a DHT-based Web crawling system, in this paper, we propose two scale-adaptable recrawl strategies aiming to find solutions to the above issue. The methods proposed are evaluated through simulations based on real Web datasets and show satisfactory results.
منابع مشابه
A Torrent Recommender based on DHT Crawling
The DHT Mainline is a significant extension to the BitTorrent protocol. The DHT Mainline has several million users and is the largest DHT network. This thesis uses the DHT Mainline to generate a recommendation system for torrents. A program was written crawling the entirety of the torrent search engine kickass.to gathering metadata about torrents. The DHT Mainline was then crawled to search for...
متن کاملEfficient Partitioning Strategies for Distributed Web Crawling
This paper presents a multi-objective approach to Web space partitioning, aimed to improve distributed crawling efficiency. The investigation is supported by the construction of two different weighted graphs. The first is used to model the topological communication infrastructure between crawlers and Web servers and the second is used to represent the amount of link connections between servers’...
متن کاملCrawling BitTorrent DHTs for Fun and Profit
This paper presents two kinds of attacks based on crawling the DHTs used for distributed BitTorrent tracking. First, we show how pirates can use crawling to rebuild BitTorrent search engines just a few hours after they are shut down (crawling for fun). Second, we show how content owners can use related techniques to monitor pirates’ behavior in preparation for legal attacks and negate any perce...
متن کاملIpmicra: Toward a Distributed and Adaptable Location Aware Web Crawler
Distributed crawling has shown that it can overcome important limitations of the centralized crawling paradigm. However, the distributed nature of current distributed crawlers is currently not fully utilized. The optimal benefits of this approach are usually limited to the sites hosting the crawler. In this work we propose IPMicra, a distributed location aware web crawler that utilizes an IP ad...
متن کاملDHT-Based Distributed Crawler
A search engine, like Google, is built using two pieces of infrastructure a crawler that indexes the web and a searcher that uses the index to answer user queries. While Google's crawler has worked well, there is the issue of timeliness and the lack of control given to end-users to direct the crawl according to their interests. The interface presented by such search engines is hence very limite...
متن کامل